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Abstract 

AUC is an important performance measure and many algorithms have been devoted to 
AUC optimization, mostly by minimizing a surrogate convex loss on a training data set. 
In this work, we focus on one-pass AUC optimization that requires going through the 
training data only once without storing the entire training dataset, where conventional 
online learning algorithms cannot be applied directly because AUC is measured by a sum of 
losses defined over pairs of instances from different classes. We develop a regression-based 
algorithm which only needs to maintain the first and second-order statistics of training 
data in memory, resulting a storage requirement independent from the size of training 
data. To efficiently handle high-dimensional data, we develop a randomized algorithm that 
approximates the covariance matrices by low-rank matrices. We verify, both theoretically 
and empirically, the effectiveness of the proposed algorithm. 

Keywords: AUC optimization, learning to rank, large-scale learning, random projection, 
square loss 



1. Introduction 



AUC (Area Under ROC curve) (|Metzl . ll97i : lHanlev and McNeil Il983h is an important per 



formance measure that has been widely u sed in many tasks ([Provost et all Il998l ; 



20041 : iLiu et all 120091 : iFlach et al 



2011 



Cortes and Mohri 



Many algorithms have been develop ed to optimize 



AUC based on surroga t e losses drlerschtal and Raskutti . 2004 : Joachims] . 2006 : Rudin and Schapire 



20091 : iKotlowski et all l201ll : IZhao et all 120111 ) 



In this work, we focus on AUC optimization that requires only one pass of training 
examples. This is particularly important for applications involving big data or streaming 
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data in which a large volume of data come in a short time period, making it infeasible to 
store the entire data set in memory before an optimization procedure is applied. Although 
many online learning algorithms have been developed to find th e optimal solution of some 



perfo rmance measures by only scanning the training data once (jCesa-Bianchi and Lugosil . 
few effort addresses one-pass AUC optimization. 



Unlike the classical classification and regression problems where the loss function can 
be calculated on a single training example, AUC is measured by the losses defined over 
pairs of instances from different classes, making it challenging to develop algorithms for 
on e-pass optim i zation . An online AUC optimization algorithm was proposed very recently 
bv IZhao et all l|201ll ). It is based on the idea of reservoir sampling, and achieves a solid 



regret bound by only storing instances, where T is the number of training examples. 
Ideally, for one-pass approaches, it is crucial that the storage required by the learning 
process should be independent from the amount of training data, because it is often quite 
difficult to expect how many data will be received in those applications. 

In this work, we propose a regression-based algorithm for one-pass AUC optimization in 
which a square loss is used to measure the ranking error between two instances from different 
classes. The main advantage of using the square loss lies in the fact that it only needs to 
store the first and second-order statistics for the received training examples. Consequently, 
the storage requirement is reduced to O(cP), where d is the dimension of data, independent 
from the number of training examples. To deal with high-dimensional data, we develop 
a randomized algorithm that approximates the covariance matrix of d x d by a low-rank 
matrix. We show, both theoretically and empirically, the effectiveness of our proposal 
algorithm by comparing to state-of-the-art algorithms for AUC optimization. 

Section [2] introduces some preliminaries. Sections [3] proposes the OPAUC (One Pass 
AUC) framework, and Section H] provides theoretical analysis and Section [5] presents detailed 
proofs. Section [6] summaries our experimental results. Section [7] concludes with future work. 

2. Preliminaries 

We denote by X £ R an instance space and y = {+1, —1} the label set, and let V denote an 
unknown (underlying) distribution over X x y. A training sample of n + positive instances 
and n_ negative ones 

S = {(x+ +1), (x+, +1), . . . , (x+ + , +1), (x7, -1), (x 2 ", -1), . . . , (x-_ , -1)} 

is drawn identically and independently according to distribution T>, where we do not fix 
n+ and ra_ before the training sample is chosen. Let /: X — > M. be a real- valued function. 
Then, the AUC of function / on the sample S is defined as 

I[/(x+) > /(XT)] + il[/(x+) = /(X7)] 

^—f n + n- 

1=1 3=1 

where ![•] is the indicator function which returns 1 if the argument is true and otherwise. 

Direct optimization of AUC often leads to an NP-hard problem as it can be cast into a 
combinatorial optimization problem. In practice, it is approximated by a convex optimiza- 
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tion problem that minimizes the following objective function 



n+ n_ I w l ( x + _ x -) 

/:(w) = ^|w| 2 + EE X ^ — « 



4=1 J = l 



where £ is a convex loss function and A is the regularization parameter that controls the 
model complexity. Notice that each loss term £(w T (x^~ — xj)) involves two instances from 
different classes; therefore, it is difficult to extend online lear ning algorithms f or one-pass 



AUC optimization without storing all the training instances. IZhao et al.1 (120111 ) addressed 
this challenge by exploiting the reservoir sampling technique. 

3. The OPAUC Approach 

To address the challenge of one-pass AUC optimization, we propose to use the square loss 
in Eq. ([1]), that is, 

i=l j=i + 

The main advantage of using the square loss lies in the fact that it is sufficient to store the 
first and second-order statistics of training examples for optimization, leading to a memory 
requirement of 0(d 2 ), which is independent from the number of training examples. Another 
advantage is that the square loss is consistent with AUC, as will be shown by Theorem Q] 
(Secti qnlH). In contrast, los s functions such as hinge loss are proven to be inconsistent with 



AUC (|Cao and Zhoul . 12012 ). 

As aforementioned, the classical online setting cannot be applied to one-pass AUC opti- 
mization because, even if the optimization problem of Eq. ([2]) has a closed form, it requires 
going through the training examples multiple times. To address this challenge, we modify 
the overall loss £(w) in Eq. ([2]) (with a little variation) as a sum of losses for individual 
training instance 2~2t = i£t(w), wn ere 

£4(W) "2 |W| + 2|{i G [*-l]:y iI/t = -l}| 

for sequence St = {(xi,yi), . . . , (xf, yt)}- It is noteworthy that £f(w) is an unbiased esti- 
mation to £(w) for i.i.d. sequence Sp For notational simplicity, we denote by Xf and X[~ 
the sets of positive and negative instances in the sequence St , respectively, and we further 
denote by T t + and T t ~~ their respective cardinalities. Also, we set £t(w) = for T^~Tf = 0. 
If yt = 1, we calculate the gradient as 



T , , V"^ x * ( x * x i~ — x « x t~ — x i x Z 



W 



V£ ( (w) = Aw + x 1 x t l w-x t + ^ ' v t ' 1 ( 3 ) 



Vi 



It is easy to observe that 



Y 2* and S f= Y X ^~ Cf [Ct 

i: Vi=—1 t i: Di=-1 t 
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Algorithm 1 The OPAUC Algorithm 

Input: The regularization parameter A > and stepsizes {r]t}f = i- 

Initialization: Set T + = T Q ~ = 0, Cq = Cq = 0, wo = and Tq = Tq = [0]dxu f° r some 
u > 

1: for t = 1,2,... ,T do 

2: Receive a training example (xt,yt) 

3: if y t = +1 then 

4: r+ = r+ 1 + iandr t - = r t - i; 

5: c t + =c+ 1 + 4(x t -c t + _ 1 )and c ( = c^; 

6: Update r+ and = T t l 1 ; 
7: Calculate the gradient gt(y*t-i) 

8: else 

9: T£ = 7p ! + 1 and T+ = T+ i; 
10: C( =c M + p(xi-c M )andc+=c ( t i; 

11: Update T^" and = 

12: Calculate the gradient gt(wt-i) 

13: end if 

14: W t = W t _i - 77t<7t(w t _l) 

15: end for 



correspond to the mean and covariance matrix of negative class, respectively; thus, Eq. ([3]) 
can be further simplified as 

VA(w) = Aw - X* + + (x t - )(xj - c f -) T w + Spw. (4) 

In a similar manner, we calculate the following gradient for yt = — 1: 

V£ t (w) = Aw + x t - c+ + (x t - c+)(x t - c+) T w + S+w (5) 

where 

c t + = E £ and E x ^ T " T c / [Ct+r 

i: y i = l t i: y i = l * 

are the covariance matrix and mean of positive class, respectively. 

The storage cost for keeping the class means (cj and cj) and covariance matrices (S^~_ 1 
and S^_i) is 0(d 2 ). Once we get the gradient V£t(w), by theory of stochastic gradient 
descent, the solution can be updated by 

w t+ i = w t - rj t V£ t (w t ) 

where rjt is the stepsize for the i-th iteration. 

Algorithm [T] highlights the key steps of the proposed algorithm. We initialize Tq = 
Tq = {0}dxdi where u = d. At each iteration, we set Tf = and = , and update 

(Line 6) and (Line 11), respectively, by using the following equations 

r, + = r+ 1 + + 4-iH-iV - 4[ctV, 

1 t 
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Finally, the stochastic gradient <7t(wt_i) of Lines 7 and 12 in Algorithm [T] are given by 
V£t(wt_i) that are calculated by Eqs. (|3J) and ©, respectively. 

Dealing with High-Dimensional Data. One limitation of the approach in Algorithm Q] 
is that the storage cost of the two covariance matrices and Sj~ is 0(d 2 ), making it 
unsuitable for high-dimensional data. We tackle this by developing a randomized algorithm 
that approximates the covariance matrices by low-rank matrices. We are motivated by the 
observation that and 5 t ~ can be written, respectively, as 

1 ( v+ 



St = ^ [X? - c+1^ ) I Tt+ [X? - c+l T + 

fiT = £ { x t ~ c t"iJ r ) V (*f - crv N T 

where It is an identity matrix of size txt and It is an all-one vector of size t. To approximate 
St and S^ , we approximate the identify matrix It by a matrix of rank r <C d. To this 
end, we randomly sample G R T ,i = l,...,t from a Gaussian distribution M(0,I T ), 
and approximate It by RtRj , where Rt = ~(ri, . . . , rt) T G R* Xr . We further divide i?t 

into two matrices where G M T i +Xr and R^ G K T( XT that contain the subset of the 
rows in Rt corresponding to all the positive and negative instances received before the t-th 
iteration, respectively. Therefore, the covariance matrices S^ and S^ can be approximated, 
respectively, by 



S+ = ^ F Z+[Z+] T -c+ 1 [c+ 1 ] T and Sf = ^Zffc 



i- iT 



where 



Z+ = X+R+, c+ = c+i; +J R+/T+ 
Zt = X t -Rt, c- = c t -lJ_iV/?T- 

Based on approximate covariance matrix S^, the approximation algorithm essentially tries 
to minimize Ylt=i £t( w )> where 

Ct(w) = w T (c~ x - Xt ) + i(l + w T 5 t -w) + ^|w| 2 + ^w T (xt - c- ^(xt - c~ x ) T w (6) 
if yt = 1; otherwise, 

&(w) = w T (xt - cU) + 1(1 + w T 5+w) + ^|w| 2 + ±w T (* - c+ ^ - c+ x ) T w. (7) 

Further, we have the following recursive formulas: 

Z+ = Z+ 1 +x t r t T % = +l]/V^, (8) 
Z t " = Z^+xtr^Ib/^-lJ/v^. (9) 

It is important to notice that we do not need to calculate and store the approximate covari- 
ance matrices S^ and S^ explicitly. Instead, we only need to maintain matrices Zf and 
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in memory. This is because the stochastic gradient <?t(w) based on the approximate 
covariance matrices can be computed directly from Z^ and . More specifically, gt(w) is 
computed as 

&(w) = c- x -x t + Aw + (xi-c-_ 1 )(x t -c f l 1 ) T w+ (^-[Z t -] T /T t - - cT-il?r-i] T ) w ( 10 ) 
for yt = 1; otherwise 

?i(w)=x t -c+ 1 + Aw + (x t -c+ 1 )(xi-c+ 1 ) T w+(z+[Z+] T /T+-c+ 1 |c+ 1 ] T ) w. (11) 



We require a memory of 0(rd) instead of 0(d 2 ) to calculate <?t(w) by using the trick 
A[,4] T w = A([A] T w), where y4 E R dxl or M dXr . 

To implement the approximate approach, we initialize Tq = Tq = [0] dXT in Algorithm[TJ 
where u = r. At each iteration, we set Tf = Z t + and = Zf , and compute the gradient 
5t(wj_i) of Lines 7 and 12 in Algorithm Q] by Eqs. (fT0j) and (fTTI) . respectively. 1^ and 
are updated by Eqs. ([8]) and ([9]), respectively. 

Remark. An alterna t ive a p proach for the h igh-dimensional case is through the random 
projection ( Johnstone! . 120061 : iHsu et all |2012j). Let H £ M. dXT be a random Gaussian 



matrix, where r <C d. By performing random projection using i7, we compute a low- 
dimensional representation for each instance xt as 5q = H T x.t £ JR T and will only maintain 
covariance matrices of size rxrin memory. Despite that it is computationally attractive, 
this approach performs significantly worse than the randomized low-rank approximation 
algorithm, according to our empirical study. This may owe to the fact that the random 
projection approach is equivalent to approximating S^ 1 = IdSfld by HH T SfHH T , which 
replaces both the left and right identity matrices of Sf 1 with HH T . In contrast, our proposed 
approach only approximates one identity matrix in S^ 1 , making it more reliable for tackling 
high-dimensional data. 



4. Main Theoretical Result 

In this section, we present the main theoretical results for our proposed algorithm. The 
following theorem shows the consistency of square loss, and the detailed proof is deferred 
in Section 15.11 

Theorem 1 For square loss £(t) = (1 — t) 2 , the surrogate loss x, x 1 ) = i(f(x) — f(x')) 
is consistent with AUC. 



Define w, as 

w* = argmin^ £ t (w). 

t 

We are in the position to present the following convergence rate for Algorithm Q] when the 
full covariance matrices are provided, and the detailed proof is deferred in Section [ 

Theorem 2 For ||x t || < 1 (t € [T]), ||w*|| < B andTL* > ^f =1 £ t (w,), we have 
^2 £ t (w t ) - ^ £*( w *) < 2kB2 + BV2kTL*, 



where k = 4 + A and rjt = 1/(k + \J (n 2 + kTL* /B 2 ). 
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This theorem presents an 0(1/T) convergence rate for the OPAUC algorithm if the 
distribution is separable, i.e., L* = 0, and an Q(l/y/T) convergence rate for general case. 
Compared to the online AUC optimization algorithm ( Zhao et al. . 201 ll ). which achieves 



at most 0(1/ VT) convergence rate, our proposed algorithm clearly reduce the regret. The 
faster convergence rate of our proposed algorithm owes to the smoothness of the square 
loss, an impor t ant p roperty that has been explored by som e studies of online learning 
( Rakhlin et al. . 20121 ) and generalization error bound analysis ( Srebro et al. . 2010l ). 



Remark: The bound in Theorem [2] does not explicitly explore the strongly convexity 
of jCt(w), which can lead to an 0(1/T) convergence rate. Instead, we focus on exploiting 
the smoothness of the loss function, since we did not introduce a bounded domain for w. 
Due to the regularizer A|w| 2 /2, we have |w*| < 1/A, and it is reasonable to restrict w t by 
|w t | < 1/A, leading to a regret bound of 0(lnT/[A 3 T]) by applying the standard stochastic 
gradient descent with rjt = l/[Xt]. This bound is preferred only when A = ^(T -1 / 6 ), a 
scenario which rarely occurs in empiri cal study. This problem ma y also be addressable by 
exploiting the epoch gradient method ( Nocedal and Wright . 19991 ). a subject of our future 
study. 

We now consider the case when covariance matrices are approximated by low-rank ma- 
trices. Note that the low-rank approximation is accurate only if the eigenvalues of covariance 
matrices follow a skewed distribution. To captur e the skewed e igenvalue distribution, we 



introduce the concept of effective numerical rank (jHansenl . [1987]) that generalizes the rank 
of matrix: 

Definition 3 For a positive constant fi > and semi-positive definite matrix M G R rfxd 
of eigenvalues {^i}, the effective numerical rank w.r.t. fi is defined to be r(M,ff) = 

It is evident that the effective numerical rank is upper bounded by the true rank, i.e., 
r(M,fi) < rank(M). To further see how the concept of effective numerical rank captures 
the skewed eigenvalue distribution, consider a PSD matrix M of full rank with Yli=k v i — I 1 
for small k. It is easy to verify that r(M,fi) < k, i.e., M can be well approximated by a 
matrix of rank k. 

Define the effective numerical rank for a set of matrices {Mt}f =1 as 

r ({M t }f =1 ,fi) = maxr(¥^). 

Under the assumption that the effective numerical rank for the set of covariance matrices 
{S t }J =1 is small (i.e., can be well approximated by low-rank matrices), the following 
theorem gives the convergence rate for | Et^( w *) ~~ Et^( w *)l' where £f(wt) are given 
by Eqs. © and ©. 

Theorem 4 Let r = r({S^}J =1 , A) be the effective numerical rank for the sequence of 
covariance matrices {S^}f =1 . For < 8 < 1, < e < 1/2, |w*| < B, ||xf|| < 1 (t € [T]) 
and TL* > Ylt=i £t( w *)> we have with probability at least 1 — 5, 



\J2 t (A(w t )-A(w,)) 



< 2eTL* + 2kB 2 + BV2kTL* 



provided r > log where k = 4 + A and rj t = !/(«+ y 7 \k 2 + kTL*/B 2 ). 
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The detailed proof is presented in Section \5. 31 For the separable distribution L* = 0, we 
also obtain an 0(1 /T) convergence rate when the covariance matrices are approximated by 
low-rank matrices. Compared with Theorem [21 Theorem 0] introduces an additional term 
2eL* in the bound when using the approximate covariance matrices, and it is noteworthy 
that the approximation does not significantly increase the bound of Theorem [2] if 2eTL* < 
B\/2(A + X)TL*, i.e., e < By / 2(X + 4)/TX*. This implies that the approximate algorithm 
will achieve similar performance as the one using the full covariance matrices provided 
r = Q(r\T(log d + logT)/(A + 4)). When A = 0(1/T), this requirement is reduced to 
r = f2(r[log d + log T]), a logarithmic dependence on dimension d. 

5. Proofs 

In this section, we present detailed proofs for our main theorems. 
5.1 Proof of Theorem [TJ 

Let X = {xi, x 2 , . . . , x n } with instance-marginal probability pi = Pr[xj] and conditional 
probability £j = Pr[y = +l|xj], and we denote by the expected risk 

Ry(f) = C + J2piPj (6(1 " SiW(*i) ~ f(*j)) + 0(1 - SiW(*j) - /(*))) 
where £(t) = (1 - t) 2 and C n is a con stant with respect to /(xj) (1 < i < re). According to 



the analysis of ( Gao and Zhoul . 2012 ). it suffices to prove that, for every optimal solution 
/, i.e., R^(f) = inf// Ry(f'), we have ffa) > f(xj) if & > 

If X = {xi,X2}, then minimizing R^(f) gives the optimal solution / = (/(xi), /(X2)) 
such that 

/( Xl ) - /(x 2 ) = sgn(e( Xl ) - £(x 2 )) for £( Xl ) / £(x 2 ), 

which shows the consistency of least square loss. 

For X = {xi,X2,-- - ,x n } with n > 3, if — £j) = for every 1 < i < re, then 
minimizing R\&(f) gives the optimal solution / = (/i, / 2 , • • • , /„) such that 

/j = /i + 1 for every & = 1 and ^ = -1, 

which shows the consistency of least square loss. 

If X = {xi,x 2 , • • • ,x n } with re > 3, and there exists some io s.t. £i (l — £j ) 7^ 0, then 
the subgradient conditions give optimal solution such that 

^Vk(ii + £k~ 2£i£ fc )(/(xi) - /(x fc )) = ^Pk(Ci ~ 60 for each 1 < « < n. 

Solving the above re linear equations, we obtain the optimal solution / = / 2 , . . . , / n ), 
i.e., 



£ *i>° Pi' • • -Pn i r(si,S2, • • • ,S r 

siH hs n — ra— 2 



where T is a polynomial in + £[fc 2 ] ~~ 2£[&i]£[A;2] for 1 < &i,A; 2 < re. In the following, 
we will derive the specific expression for T(s\, s 2 , • • • , s n ). Denote by A = {i : Si > 1} and 
B = {i: Si = 0} = {61, 6 2 , • • • ,6|g|}. 
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If \A\ = 1, i.e., A = for some 1 < h < n, then 

r(si,s 2 ,--- ,s n ) = + 6-2^16)- 

keB 

If |.A| = 2, i.e., A = {11,12} f° r some 1 < ii,i2 < «, then we denote by 

Ai = {s^ n} lj{si 2 ^2} 

where {s^ 0ifc} denotes the multi-set {i^, i^, . . . , of size Si k for k = 1,2. It is clear 
that |£>| = |^4i I = n — 2. Further, we denote by Q(Ai) the set of all permutations of 
Ai- Therefore, we have 

n-2 

r(si, 82, ■ ■ ■ , s n ) = (e tl + & - 2£ n & 2 ) J2 lite* + & - 2 ^)- 

7r=7ri---7r n _ 2 ee(^i) fc=l 

If |„4| > 2, then, for ii / i 2 G .4., we denote by the multi-set 

*2) = © *i} U*t*« © *a} (J I U {(s k -l)ek} 

\keA\{ii,t2} 

and it is easy to derive |„4i| = \B\. Further, we denote by Q(A \ {11,12}) an d G{Ai) 
the set of all permutations of A \ {ii, 12} and Ai, respectively. Therefore, we set 

Pi {h , %2 , A) = J2 (6l + ~ 2& CtT! ) (&T1 + Zn 2 - &r 2 ) 

7r=7ri7r2---7r|^|_ 2 eS(-4\{ii,i2}) 
X ' ' ' X (^7T| A |_ 3 + i-n\ A \-2 ~ 2?7r| A |_3?7r| A |_ 2 )(?7r| A |_ 2 + £i 2 ~ 2£j 2 £7T| A |_ 2 ), 

and we have 

\B\ 



V{si,s 2 ,--- ,s n ) = r i(«i>*2,^) Y n(^ fc - 2^ fc 6 



»l^»2 7r=7ri7r 2 ...7ri B iG6(^li) fc=l 

»i,»2e^ 



where # = {61, 62, • • -,b\ B \}- 
Since there exist some iq s.t. £j (l — £j ) / 0, we have 



£ »i>o •••Pn n r(si,s 2 ,--- ,S n ) 

s^H hsn— n — 2 



> 0. 



Therefore, it is evident that /(xj) > /(xj) if £3 > £j, and this theorem follows as desired. 
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5.2 Proof of Theorem [2] 

This proof is motivated from ( Shalev-Shwartz . 2007 ; Srebro et al. . 2010l ). Recall 

H J 2 |W| + 2 \{ie[t-l}: yi ^y t }\ 
For |w„,| < B and convex £t(w), we have 

Ct(wt) - C t (w*) < VA(wt) T (w t - w*). 
It is easy to derive that V£t(wf) equals to 



(12) 



Aw t 



|{i G [t - 1] : Vi ± y t }\ 
and therefore, for any w and |xj| < 1 

|V£t(w t ) - V£ t (w)| < (4 + A)|wi - w|. 

Denote by 

W(« = argmin£ i (w i ), 

w 

which implies that V£t(w£*) = for convex and smooth Ct- Based on (jNesterovi . jpoj , 
Theorem 2.1.5), we have 

|V£ t (w t )| 2 = |VA(w t ) - V£ t (w t *)| 2 < 2(A + A)L t (w t ) (13) 

where the inequality holds from £ t (w t *) > and V£t(\Vt*) = 0. Moreover, we have 

|w t+ i-w*| 2 = |w i -7y i V£ t (w t )-w i(! | 2 = |w t -w*| 2 -2?7 t V£ t (w t ) T (w t -w*)+7/ 2 |V£ t (w t )| 2 , 

and this yields that, by using Eqs. (fT2|) and (fT3"|) . 



(1 - (4 + X)r)t)£t(yft) ~ A(w*) < 7^-|w t - w*| 2 - — |w t+ i - w*| 2 . 
Summing over t = 0, . . . , T — 1 and rearranging, we obtain 

T-l T-l 



^(l-(4 + AH)A(wi)-^£ t (w*) 



t=0 



1 



<- 1 I |2 

2?7o 2t7t-i 



T-2 



+ Y(— -)|w 4 



By setting t\i = Tj, we have 
1 , 



W"o — WJ 

2?7o 2r? T _i 



1 , l2 1 , l2 # 2 

1 - 2rj l 1 ~ 2r? 
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from wo = and |w*| < B, and we further get 

Ew-e^) < fg+(4+A) T? |; 1 A(w,) 

t=0 i=0 V 7 ' \ ' t=0 > 

This theorem holds by putting 

1 

r) 



4 + A + y / (4 + A) 2 + (4 + X)TL*/B 2 
into the above formula and using the formula \/a + b < -y/a + y/b. ■ 

5.3 Proof of Theorem |4] 

Before the detailed proof of Theorem HJ we begin with some useful results: 

Lemma 5 Let Si = diag(su) and S2 = diag(s2i) be two d x d diagonal matrices such that 
su 7^ and + s\ = 1 for all i. For a Gaussian random matrix R G M rfXT , we set 
Z = SiS\ + S2RR T S2 and r = ^ s| i? and t/ie followings hold 

Pr[Ai(Z) > 1 + e] < dexp(-Te 2 /32r) and Pr[A p (Z) < 1 - e] < dexp(-re 2 /32r), 

where \k(Z) denotes the k-th largest eigenvalue of matrix Z. 

Proof This proof technique is motivated from (jGittens and Troprj . l201lh by adding a bias 
matrix. Let g(8) = 2 e 2g . Then, we have 

Pr[Ai(M) > 1 + e] 

< inf trexpjfl (jSrfi + E[S 2 RR T S 2 ] - (1 + e)/) + ^^[(^i^) 2 ]^} 

< inf trexp{-#e + 8c/(#)tr(Sf),Sf} 

< inf dexp{-6e + 8r#(0)} < dexp(-re 2 /32r). 

6»>0 

In a similar manner, 

Pr[A p (M) < 1 - e] 

< inf tr exp \e(SiSt + E[S 2 RR T S 2 ] - (1 - e)l) + g(6)E[(S 2 RR T S 2 ) 2 ]/t\ 

< inf dexp{-6>e + 8r#(6>)} < dexp(-re 2 /32r). 

0>O 

This completes the proof. ■ 

Let M <E R dxd be a positive semi-definite (PSD) matrix with effective numerical rank 
r(M, fj,) for /x > 0. We define two matrices K and -fT, respectively, as 

K := fj,I d + M and K := fj,I d + M~ 1/2 RR T M~ 1/2 , 

where R 6 ]g^ xm i s a (Gaussian) random matrix. Based on Lemma[5j we have the following 
theorem that bounds the difference between K — K: 
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Lemma 6 Let r(M,fi) be the numerical rank for fj, > and PSD matrix M. Then, for 
5 > and e > ; the following holds with probability at least 1 — 5 

\\I - K- 1/2 KK- 1/2 \\ 2 < e, 

where \\Z\\2 measures the spectral norm of matrix Z, provided 

Proof Let M = J7diag(o"?)y T be the singular value decomposition of M. We define 



Si = diag I , / — 5— — ,...,,/— 5— — ) and S 2 = diag 1 



9 , '■'■'A/ 2i / ullu u ™6 I / — n ! ■ • • 1 / 



It is easy to observe that 

Z = K- l / 2 KK- 1 ' 2 = U(Sf + S 2 V T RR T V)U r = U(S 2 + S 2 RR T )U T 

where R = V T R G M. dXT is a also Gaussian random matrix because V is an orthonormal 
matrix. Parameter r in Lemma [5] is given by 

Using Lemma [5l the followings hold with a probability at least 1 — 5, 

A max (Z) = \\K- x l 2 KK- l / 2 \\ 2 < 1 + e and A min (Z) = \ d (r'^RR- 1 / 2 ) > 1 - e, 
which yields that ||2f — I\\ < e provided 

32r , 2d 
t > log— . 

e A 

This lemma follows as desired. ■ 
Recall that 

A(w) = ^|w| 2 + w T (c t l 1 - xt) + i + ~ (w T (x 4 - c t ~ x )(x t - c t "_ 1 ) T w + w T S 1 Tw 
if yt = 1; otherwise, 

A(w) = ^|w| 2 + w T ( Xi - c+J + ~ + i (w T (x t - c+ !)(xt - c+ x ) T w + w T £+w) . 

We further define w* as the optimal solution that minimizes the loss based on approximate 
covariance matrices, i.e. 



T 

w ai'R ium^J£t( w l 



Based on Lemma [6l the following theorem gives an upper bound for |5^£t(w*) 
EtA(w*)|. 
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Theorem 7 Let r({S t }f =1 , X) be the effective numerical rank for the set of covariance 
matrices S^,t = 1,...,T with respect to the regularization parameter A. Then, for any 
< 5 and < e < 1/2, the fallowings hold with probability at least 1 — 5 

|w* - w*| < 2e|w*| (14) 
£ t A(w*)-£ t A(w*) <2eEtA(w,) (15) 



provided that 

32r(S, A) , 2d 



Proof We first rewrite £(w) = Ylt=i £*( w ) as 



1 1 
= - + w T a + -w T (Ai + A 2 ) w 



where 



^ I[y t = 1] (c-, - x f ) + % = -1] (c+_! - 
t=l 

1 T 

Ax = AJ d + -X;(% = l]5 t - 1 +% = -1)5+0 
t=l 

T T 
A 2 = I^I[y, = l](x 4 -C,-)(x t -Cr) T + -^%i = -l](x t -C + )(x t -C+) T . 
i=l i=l 

Similarly, we rewrite £(w) = X]^=i^*( w ) as 

£(w) = - + w T a + ^w T (A x + A 2 ) w 

where 

T T 

Ax = A/ d + i]T% = l}[Sr^ 2 RtRj[S^ 2 + -J2l[y t = -l][S£ 1 ] 1 ' i R t R[[S+_ 1 ] 1 ' 2 . 
t=i t=i 

The optimal solutions for minimizing £(w) and £(w) are given, respectively, by 

w* = (Ai + A 2 )~ 1 a and w* = (A x + A 2 ) _1 a. 

— 1/2 T — 1/2 — 

Define A = / — A 1 A\A 1 and write A\ in terms of A as 

A 1 = A 1 - A{ /2 AA{ /2 

Using Lemma [6l it holds that eld ^ A ■< el^ with probability at least 1 — <5T, and therefore 
(1 - e)(Ai + A 2 ) ^ Ai + A 2 ^ (1 + e)(Ai + A 2 ) 
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Table 1: Benchmark datasets 



datasets 


#inst 


#feat 


datasets 


#inst 


#feat 


diabetes 


768 


8 


w8a 


49,749 


300 


fourclass 


862 


2 


kddcup04 


50,000 


65 


german 


1,000 


24 


mnist 


60,000 


780 


splice 


3,175 


60 


connect-4 


67,557 


126 


usps 


9,298 


256 


acoustic 


78,823 


50 


letter 


15,000 


16 


ijcnnl 


141,691 


22 


magic04 


19,020 


10 


epsilon 


400,000 


2,000 


a9a 


32,561 


123 


covtype 


581,012 


54 



Denote by 

n=(A 1 + A 2 fl 2 {A l + A 2 )-\A l + A 2 fl 2 - I, 
and according to previous analysis, we have 



mi < 



< 2e 



for e < 1/2. Therefore, 



w* — w* 



and 



£(w*) - £(w*; 



< 2e 

This theorem follows as desired. 



{A l +A 2 )~ 1 -{A l + A2)- 1 ja, 

= \{A X + A 2 )~ x/2 £l{A 1 + A 2 )~ 1/2 a 
< 2e\(A 1 + A 2 )~ 1 a\ < 2e|w*| 



(A 1 + A 2 y 1 -(A 1 +A 2 )- l )a 
a T ( (Ax + A 2 y l / 2 £l(Ai + A 2 y 1 / 2 ) a 
< 2e 



-^(A 1+ A 2 ) 



- + - a T (A l + A 2 ) x a 



2eC 



Proof of Theorem H For \w*\ < B, £(w*) < TL*, and < e < 1/2, we have 

|w*| < |w*| + |w* — w*| < 2B 
from Eq. (|14p . and we further have 

£*(w*) < £(w*) + |£*(w*) - £(w,)| < 2£(w») < 2TL* 
from Eq. (|15p . Therefore, we have 



^£(w ( )-^£(w t ) 



<^£(w t )-^£(w* 
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Table 2: Testing AUC (meanistd.) of OPAUC with online algorithms on benchmark 
datasets. »/o indicates that OPAUC is significantly better /worse than the cor- 
responding method (pairwise i-tests at 95% significance level). 



datasets 


OPAUC 


OAM scq 


OAM gra 


online Uni-Exp 


online Uni-Scju 


diabetes 


.8309± 


0350 


.8264±.0367 


.8262±.0338 


.ozlDzt.UoUy* 


.OZDO±.Uo04 


fourclass 


.8310± 


0251 


.8306±.0247 


.8295±.0251 


onoi _i none; 


.o^yi±.UoU4 


german 


.7978± 


0347 


.7747± 


0411* 


.7723± 


0358. 


. f yUo±.Uoo7 


.7oyy±.Uo4y 


splice 


.9232± 


0099 


.8594± 


0194» 


.8864± 


0166« 


QOQ1 _l_ HOI 


("11 rr o 1 ni QO« 


usps 


.9620± 


0040 


.9310± 


0159« 


.9348± 


0122« 


.yOoo±.UU4o» 


.youo±.UU41» 


letter 


.8114± 


0065 


.7549± 


0344» 


.7603± 


0346« 


Qi i 0074 


.OUOODI . UUOl" 


magic04 


.8383± 


0077 


.8238± 


0146* 


.8259± 


0169« 


.8354±.0099« 


.8344±.0086« 


a9a 


.9002± 


0047 


.8420± 


0174. 


.8571± 


0173. 


.9005±.0024 


.8949±.0025» 


w8a 


.9633± 


0035 


.9304± 


0074. 


.9418± 


0070« 


.7693±.0986« 


.8847±.0130» 


kddcup04 


.7912± 


0039 


.6918± 


0412» 


.7097± 


0420» 


.7851±.0050» 


.7850±.0042« 


mnist 


.9242± 


0021 


.8615± 


0087» 


.8643± 


0112. 


.7932±.0245» 


.9156±.0027» 


connect-4 


.8760± 


0023 


.7807± 


0258* 


.8128± 


0230» 


.8702±.0025» 


.8685±.0033» 


acoustic 


.8192± 


0032 


.7113± 


0590* 


.7711± 


0217« 


.8171±.0034» 


.8193±.0035 


ijcnnl 


.9269± 


0021 


.9209± 


0079« 


.9100± 


0092. 


.9264±.0035 


.9022±.0041» 


epsilon 


.9550± 


0007 


.8816± 


0042» 


.8659± 


0176« 


.9488±.0012» 


.9480±.0021. 


covtype 


.8244± 


0014 


.7361± 


0317* 


.7403± 


0289* 


.8236±.0017 


.8236±.0020 


win/tie/loss 


14/2/0 


14/2/0 


10/6/0 


11/5/0 



We use Theorem 1 (in the main paper) to bound the first term in the above by combining 
Eqs. (fT7|) and (fT8|) . and the second term can be bounded by Eq. (fl5|) . This completes the 
proof as desired. ■ 

6. Experiments 

We evaluate the performance of OPAUC on benchmark datasets and high-dimensional 
datasets in Sections 16.11 and 16.21 respectively. Then, we study the parameter influence 
in Section I6.3L 

6.1 Comparison on Benchmark Data 

We conduct our experiments on sixteen benchmark datasetflUl as summarized in Table [TJ 
Some datasets have been used in previous studies on AUC optimization, whereas the other 
are large ones requiring one-pass procedure. The features have been scaled to [—1,1] for 
all datasets. Multi-class datasets have been transformed into binary ones by randomly 
partitioning classes into two groups, where each group contains the same number of classes. 



1. http://www.sigkdd.org/kddcup/ 

2. http:/ /www. ics.uci.edu/~mlearn/MLRepository.html 

3. http:/ /www.csie.ntu.edu.tw/~cjlin/libsvmtools/ 
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Table 3: Testing AUC (meanistd.) of OPAUC with batch algorithms on benchmark 
datasets. »/o indicates that OPAUC is significantly better /worse than the cor- 
responding method (pairwise i-tests at 95% significance level). 



datasets 


OPAUC 


SVM-perf 


batch SVM-OR 


batch LS-SVM 


batch Uni-Log 


batch Uni-Squ 


diabetes 


.8309±.0350 


.8325±.0220 


.8326±.0328 


.8325±.0329 


.8330±.0322 


.8332±.0323 


fourclass 


.8310±.0251 


.8221±.0381 


.8305±.0311 


.8309±.0309 


.8288±.0307 


.8297±.0310 


german 


.7978±.0347 


.7952±.0340 


.7935±.0348 


.7994±.0343 


.7995±.0344 


.7990±.0342 


splice 


.9232±.0099 


.9235±.0091 


.9239±.0089 


.9245±.0092o 


.9208±.0107» 


.9211±.0107» 


usps 


.9620±.0040 


.9600±.0054« 


.9630±.0047o 


.9634±.0045o 


.9637±.0041o 


.9617±.0043 


letter 


.8114±.0065 


.8028±.0074« 


.8144±.0064o 


.8124±.0065o 


.8121±.0061 


.8112±.0061 


magic04 


.8383±.0077 


.8427±.0078o 


.8426±.0074o 


.8379±0.0078 


.8378±.0073 


.8338±.0073. 


a9a 


.9002±.0047 


.9033±.0039 


.9009±.0036 


.8982±.0028» 


.9033±.0025o 


.8967±.0028» 


w8a 


.9633±.0035 


.9626±.0042 


.9495±.0082. 


.9495±.0092» 


.9421±.0062« 


.9075±.0104. 


kddcup04 


.7912±.0039 


.7935±.0037o 


.7903±.0039. 


.7898±.0039« 


,7900±.0039. 


.7926±.0038 


mnist 


.9242±.0021 


.9338±.0022o 


.9340±.0020o 


.9336±.0025o 


.9334±.0021o 


.9279±.0021o 


connect-4 


.8760±.0023 


.8794±.0024o 


.8749±.0025» 


.8739±.0026« 


.8784±.0026o 


.8760±.0024 


acoustic 


.8192±.0032 


.8102±.0032« 


.8262±.0032o 


.8210±.0033o 


.8253±.0032o 


.8222±.0031o 


ijcnnl 


.9269±.0021 


.9314±.0025o 


.9337±.0024o 


.9320±.0037o 


.9282±.0023o 


.9038±.0025. 


epsilon 


.9550±.0007 


.8640±.0049« 


.8643±.0053. 


.8644±.0050» 


.8647±.0150« 


.8653±.0073. 


covtype 


.8244±.0014 


.8271±.0011o 


.8248±.0013 


,8222±.0014« 


.8246±.0010 


.8242±.0012 


win/tie/loss 


4/6/6 


4/6/6 


6/4/6 


4/6/6 


6/8/2 



In addition to state-of-the-art online AUC approaches OAM seq and O AM gra (|Zhao et al 

l201ll ). we also compare with: 

• online Uni-Exp: An o nline learning algorithm which optimizes the (weighted) uni- 
variate exponential loss (jKotlowski et all l201ll ) ; 

• online Uni-Squ: An online learning algorithm which optimizes the (weighted) uni- 
variate square loss; 



• SV M-perf: A batch learning algorithm which directly optimizes AUC (jJoachimsl . 



2005); 



batch SVM-OR : A batch learning algorithm which optimizes the pairwise hinge loss 
(|joachimsl . l2006h : 

batch LS-SVM: A batch learning algorithm which optimizes the pairwise square 
loss; 



batch Uni-Log : A batch learning algo rithm which optimizes the (weighted) univari- 
ate logistic loss (jKotlowski et all l201ll ) ; 

batch Uni-Squ: A batch learning algorithm which optimizes the (weighted) univari- 
ate square loss. 
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viooo 



C=l OPAUC 
V777A OAMseq 
OAMgra 
I I Online Uni-Exp 
rv^l Online Uni-Squ 






datasets 



Figure 1: Comparison of the running time (in seconds) of OPAUC and online learning 
algorithms on benchmark data sets. Notice that the y-axis is in log-scale. 



Table 4: High-dimensional datasets 



datasets 


#inst 


#feat 


datasets 


#inst 


#feat 


sector 
sector. Ivr 
news20 


9,619 
9,619 
15,935 


55,197 
55,197 
62,061 


news20. binary 
rcvlv2 
ecm!2012 


19,996 
23,149 
456,886 


1,355,191 
47,236 
98,519 



All experiments are performed with Matlab 7 on a node of computational cluster with 
16 CPUs (Intel Xeon Due Core 3.0GHz) running RedHat Linux Enterprise 5 with 48GB 
main memory. For batch algorithms, due to memory limit, 8,000 training examples are 
randomly chosen if training data size exceeds 8,000, whereas only 2,000 training examples 
are used for the epsilon dataset because of its high dimension. 

Five-fold cross-validation is executed on training sets to decide the learning rate rjt G 
2[-i2:i0] £ or on ii ne algorithms, the regularized parameter A G 2[~ 10:2 1 for OPAUC and A G 
2 [-iO:io] for batch al gorithms. For OAM sfiq and OAM gra , the buffer sizes are fixed to be 100 



as recommended in (jZhao et all 120111). For univari ate approaches, the weights (i.e., class 



ratios) are chosen as done in kotlowski ^D . B ■ 

The performances of the compared methods are evaluated by five trials of 5-fold cross val- 
idation, where the AUC values are obtained by averaging over these 25 runs. Table [2] shows 
that OPAUC is significant better than the other four online algorithms OAM scq , OAM gra , 
online Uni-Exp and online Uni-Squ, particularly for large datasets. The win/tie/loss counts 
show that OPAUC is clearly superior to these online algorithms, as it wins for most times 
and never loses. Table [3] shows shows that OPAUC is highly competitive to the other five 
batch learning algorithms; this is impressive because these batch algorithms require stor- 
ing the whole training dataset whereas OPAUC does not store training data. Additionally, 
batch LS-SVM which optimizes the square loss is comparable to the other batch algorithms, 
verifying our argument that square loss is effective for AUC optimization. 
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Table 5: Testing AUC (meanistd.) of OPAUCr with online methods on high-dimensional 
datasets. »/o indicates that OPAUCr is significantly better/worse than the cor- 
responding method (pairwise t-tests at 95% significance level). 'N/A' means that 
no result was obtained after running out 10 6 seconds (about 11.6 days). 



datasets 


sector 


sector. Ivr 


news20 


news20. binary 


rcvlv2 


ecml2012 


OPAUCr 


.9292±.0081 


.9962±.0011 


.8871±.0083 


.6389±.0136 


.9686±.0029 


.9828±.0008 


OAM soq 


.9163±.0087» 


.9965±.0064 


.8543±.0099« 


.6314±.0131« 


.9686±.0026 


N/A 


OAMgra 


.9043±.0100» 


.9955±.0059 


.8346±.0094« 


.6351±.0135» 


.9604±.0025» 


.9657±.0055. 


online Uni-Exp 


.9215±.0034» 


.9969±.0093 


.8880±.0047 


.6347±.0092» 


.9822±.0042o 


.9820±.0016» 


online Uni-Squ 


.9203±.0043» 


.9669±.0260 


.8878±.0066 


.6237±.0104» 


.9818±.0014 


.9530±.0041» 


OPAUC f 


.6228±.0145» 


.6813±.0444» 


.5958±.0118« 


.5068±.0086« 


.6875±.0101. 


.6601±.0036» 


OPAUC rp 


.7286±.0619» 


.9863±.0258» 


.7885±.0079» 


.6212±.0072« 


.9353±.0053. 


.9355±.0047» 


OPAUC pca 


.8853±.0114« 


.9893±.0288» 


.8878±.0115 


N/A 


.9752±.0020o 


N/A 



We also compare the running time of OPAUC and the online algorithms OAM scq , 
OAMg ra , online Uni-Exp and online Uni-Squ, and the average CPU time (in seconds) are 
shown in Figure [H As expected, online Uni-Squ and online Uni-Exp takes the least time 
cost because they optimize on single-instance (univariate) loss, whereas the other algorithms 
work by optimizing pairwise loss. On most datasets, the running time of OPAUC is com- 
petitive to OAM seq and OAM gra , except on the mnist and epsilon datasets which have the 
highest dimension in Table [TJ 

6.2 Comparison on High-Dimensional Data 

Next, we study the performance of using low-rank matrices to approximate the full co- 
variance matrices, denoted by OPAUCr. Six datasetfl! with nearly or more than 50,000 
features are used, as summarized in Table[U The news20. binary dataset contains two classes, 
different from news20 dataset. The original news20 and sector are multi-class datesets; in 
our experiments, we randomly group the multiple classes into two meta-classes each con- 
taining the same number of classes, and we also use the sector. Ivr dataset which regards 
the largest class as positive whereas the union of other classes as negative. The original 
ecm 12012 and rcvlv2 are multi-label datasets; in our experiments, we only consider the la- 
bel with the largest population, and remove the features in ecml2012 dataset that take zero 
values for all instances. 

Besides the online algorithms OAM seq , OAM gra , online Uni-Exp and online Uni-Squ, 
we also evaluate three variants of OPAUC to study the effectiveness of approximating full 
covariance matrices with low-rank matrices: 

• OPAUC f : Randomly selects 1,000-dim features and then works with full covariance 
matrices; 



4. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ 

5. http: / /www. ecmlpkdd2012.net / discovery-challenge 
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datasets 

Figure 2: Comparison of the running time on high dimensional datasets. Full black columns 
imply that no results were returned after running out the maximal running time. 



• OPAUC rp : Projects into a 1, 000-dim feature space by Random Projection, and then 
works with full covariance matrices; 

• OPAUC pca : Projects into a 1, 000-dim feature space obtained by Principle Compo- 
nent Analysis, and then works with full covariance matrices. 

Similar to Section 5.1, five- fold cross validation is executed on training sets to decide the 
learning rate rjt £ 2t~ 12;10 l and the regularization parameter A € 2^ 10:2 1. Due to memory 
and computational limit, the buffer sizes are set to 50 for OAM seq and OAM gra , and the rank 
r of OPAUCr is also set to 50. The performances of the compared methods are evaluated 
by five trials of 5-fold cross validation, where the AUC values are obtained by averaging 
over these 25 runs. 

The comparison results are summarized in Table [5] and the average running time is 
shown in Figure [2j These results clearly show that our approximate OPAUCr approach 
is superior to the other compared methods. Compared with OAM seq and OAM gra , the 
running time costs are comparable whereas the performance of OPAUCr is better. Online 
Uni-Squ and Uni-Exp are more efficient than OPAUCr because it optimizes univariate loss, 
but the performance of OPAUCr is highly competitive or better, except on rcvlv2, the only 
dataset with less than 50,000 features. Compared with the three variants, OPAUC f and 
OPAUC rp are more efficient, but with much worse performances. OPAUC pca achieves a 
better performance on rcvlv2, but it is worse on datasets with more features; particularly, 
on the two datasets with the largest number of features, OPAUC pca cannot return results 
even after running out 10 6 seconds (almost 11.6 days). Our approximate OPAUCr approach 
is significantly better than all the other methods (if they return results) on the two datasets 
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Figure 3: Influence of stepsize rjt 
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Figure 4: Influence of regularization parameter A 



with the largest number of features: news. binary with more than 1 million features, and 
ecm 12012 with nearby 100 thousands features. These observations validate the effectiveness 
of the low-rank approximation used by OPAUCr for handling high-dimensional data. 

6.3 Parameter Influence 

We study the influence of parameters in this section. Figure [3] shows that stepsize rjt should 
not be set to values bigger than 1, whereas there is a relatively big range between [2 -12 , 2 -4 ] 
where OPAUC achieves good results. Figures [4] shows that OPAUC is not sensitive to the 
value of regularization parameter A given that it is not set with a big value. Figure [5] 
shows that OPAUCr is not sensitive to the values of rank r, and it works well even when 
r = 50; this verifies Theorem H] that a relatively small t value suffices to lead to a good 
approximation performance. Figure [6] compares studies the influence of the iterations for 
OPAUC, OAM seq and OAM gra , and it is observable that OPAUC convergence faster than 
the other two algorithms, which verifies our theoretical argument in Section HJ 
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Figure 5: Influence of rank r 
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7. Conclusion 

In this paper, we study one-pass AUC optimization that requires going through the training 
data only once, without storing the entire dataset. Here, a big challenge lies in the fact that 
AUC is measured by a sum of losses defined over pairs of instances from different classes. 
We propose the OPAUC approach, which employs a square loss and requires the storing of 
only the first and second-statistics for the received training examples. A nice property of 
OPAUC is that its storage requirement is 0(ri 2 ), where d is the dimension of data, indepen- 
dent from the number of training examples. To handle high-dimensional data, we develop 
an approximate strategy by using low-rank matrices. The effectiveness of our proposed 
approach is verified both theoretically and empirically. In particular, the performance of 
OPAUC is significantly better than state-of-the-art online AUC optimization approaches, 
even highly competitive to batch learning approaches; the approximate OPAUC is signifi- 
cantly better than all compared methods on large datasets with one hundred thousands or 
even more than one million features. An interesting future issue is to develop one-pass AUC 
optimization approaches not only with a performance comparable to batch approaches, but 
also with an efficiency comparable to univariate loss optimization approaches. 
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